Phishing Websites Spotting with Help of using Machine Learning Tools

Authors: Prasanth Baskaran, Ronald Issac B J, Manju R, Chandru. G

DOI Link: https://doi.org/10.22214/ijraset.2023.51083

Abstract

Phishing assaults cost internet users billions of dollars every year and are a constantly growing hazard in the cyberspace. It is illegal to gather sensitive information from consumers through a number of social engineering techniques. Email, instant messaging, pop-up messages, web pages, and other forms of communication can all be used to identify phishing tactics. This work offers a model that can determine whether a URL link is genuine or fraudulent. The data set used for the classification was sourced from the University of New Brunswick dataset bank, which has a collection of benign, spam, phishing, malware, and defacement URLs, as well as from an open source service called \"Phish Tank,\" which contains phishing URLs in multiple formats such as CSV, JSON, etc. Phishing URLs are identified using a combination of deep neural network methods and more than six machine learning models. The goal of this study is to create a web application software that can identify phishing URLs from a database of more than 5,000 URLs that have been randomly selected, divided into 80,000 training samples and 20,000 testing samples, and then divided again into equal portions of phishing and legitimate URLs. To distinguish between legal and phishing URLs, the URL dataset is trained and tested using feature selections like address bar-based features, domain-based features, HTML & JavaScript-based features. Finally, the study provided a model for classifying URLs as phishing or legitimate. This would be extremely useful in assisting individuals and businesses in identifying phishing attacks by authenticating any link provided to them to prove its validity.

Introduction

I. INTRODUCTION

The Internet, particularly social media, has become an important part of our lives for gathering and disseminating information. According to Pamela (2021), the Internet is a network of computers that contain valuable data; thus, many security mechanisms are in place to protect that data; however, there is a weak link: the human. Security mechanisms have a much more difficult time protecting a user's data and devices when they freely give away their data or access to their computer.

Imperva (2021) defines social engineering (a type of attack used to steal user data such as login credentials and credit card numbers) as one of the most common social engineering attacks. When an attacker tricks a victim into opening an email, instant message, or text message that appears to be from a trusted source, the attack occurs. When the recipient clicks the link, they are duped into thinking they have received a gift and unknowingly click a malicious link, which results in the installation of malware, the freezing of the system as part of a ransom ware attack, or the disclosure of sensitive information.

Computer security threats have grown significantly in recent years, owing to the rapid adoption of technological advancements, while also increasing the vulnerability to human exploitation. Users should understand how phishers operate and be aware of techniques to help protect themselves from being phished. As a result, this is a rapidly evolving threat to individuals as well as large and small businesses. Criminals now have access to industrial-strength services on the dark web, resulting in an increase in the number of these phishing links and emails, as well as an increase in 'quality,' making them more difficult to detect.

II. METHODOLOGY

An extensive review was conducted on related topics and existing documented materials such as journals, e-books, and websites containing related information gathered, which was examined and reviewed to retrieve essential data to help better understand and improve the system. The methodology used to achieve the previously stated goals is described below. The dataset is made up of phishing and legitimate URLs obtained from open-source platforms. To avoid data imbalance, the dataset was pre-processed, which means it was cleaned up of any anomalies such as missing data. Following that, expository data analysis was performed on the dataset in order to explore and summarise it.

Once the dataset had been cleaned of all anomalies, website content-based features were extracted from it in order to obtain accurate features for training and testing the model. To best decide the classification models to solve the problem of detecting phishing websites, an extensive review of existing works of literature and machine learning models on detecting phishing websites was performed.

As a result, a series of machine learning classification models, including Decision Tree, Support Vector Machine, XGBooster, Multilayer Perceptions, Auto encoder Neural Network, and Random Forest, were deployed on the dataset to differentiate between phishing and legitimate URLs. Out of all the deployed models, the best model with the highest training accuracy was chosen and integrated into a web application. As a result, a user can enter a URL link into the web application to determine whether it is phishing or legitimate.

III. LITERATURE REVIEW

According to Ankit and Gupta (2017), according to Internet world stats ("Internet world stats usage and population statistics", 2014), the total number of Internet users worldwide in 2014 is 2.97 billion; that is, more than 38% of the world population uses the Internet. Hackers exploit insecure Internet systems to trick unsuspecting users into falling for phishing scams. On the Internet, phishing emails are used to defraud both individuals and financial institutions. (n.d., "RSA Anti-Fraud Command Centre") According to their website, the Anti-Phishing Working Group (APWG) is an international consortium dedicated to promoting research, education, and law enforcement in order to eliminate online fraud and cybercrime.

Total phishing attacks increased by 160% in 2012 compared to 2011, indicating a record year for phishing volumes. In 2013, approximately 450,000 phishing attacks were detected, resulting in financial losses of more than $5.9 billion ("RSA Anti-Fraud Command Centre", n.d.).

In 2013, total attacks increased by 1% over 2012. The total number of phishing attacks detected in the first quarter of 2014 was 125,215, representing a 10.7 percent increase over the fourth quarter of 2013. To fool users, more than 55% of phishing websites include the target site's name in some form, and 99.4% of phishing websites use port 80 ("Anti-Phishing Working Group (APWG) Phishing activity trends report first quarter", 2014).

According to an APWG report published in the first quarter of 2014, the second-highest number of phishing attacks ever recorded occurred between January and March 2014 ("Anti-Phishing Working Group (APWG) Phishing activity trends report first quarter", 2014), with payment services being the most targeted industry. 123,972 unique phishing attacks were observed in the second half of 2014. ("APWG report", 2014). Total financial losses in 2011 were 1.2 billion dollars, and they increased to 5.9 billion dollars in 2013.

IV. PROPOSED SYSTEM

This chapter describes the various processes, methods, and procedures used by the researcher to achieve the stated goals and objectives, as well as the conceptual framework in which the research was carried out.

Any research work's methodology refers to the research approach taken by the researcher to address the stated problem. Because the efficiency and maintainability of any application are solely determined by how designs are created. This chapter contains detailed descriptions of the methods used to provide solutions to the research work's stated objectives.

System analysis, according to Merriam-Webster (11th edition), is "the process of studying a procedure or business to identify its goals and purposes and create systems and procedures that will efficiently achieve them." It is also the act, process, or profession of studying an activity (such as a procedure, a business, or a physiological function) typically through mathematical means in order to define its goals or purposes and discover operations and procedures for most efficiently accomplishing them. System analysis is used in every field where something is created.

Prior to planning and development, you must thoroughly understand the old systems and use that knowledge to determine how well your new system can function.

Machine learning models and deep neural networks are used in the proposed phishing detection system. The machine learning models and a web application are the two main components of the system. Decision Tree, Support Vector Machine, XGBooster, Multilayer Perceptions, Auto Encoder Neural Network, and Random Forest are among the models used.

These models are chosen based on the comparative performance of various machine learning algorithms. Each of these models is trained and tested on a content-based website feature extracted from both phishing and legitimate datasets.

As a result, the model with the highest accuracy is chosen and integrated into a web application that allows users to predict whether a URL link is phishing or legitimate.

V. MODEL DEVELOPMENT

The model development method starts with several models, tests them, and then adds them to an iterative process until a model that meets the requirements is created. Figure 5.1 depicts the steps involved in the development of supervised and unsupervised machine learning models.

The following are the stages to machine learning model development for phishing detection systems:

A. Data Collection

Data for the datasets on which the models are trained is obtained from various open-source platforms. The dataset set includes phishing and legitimate URL datasets. The phishing URLs were gathered from Phish Tank, an open-source service. This service provides a collection of phishing URLs in various formats such as CSV, JSON, and others that is updated hourly. This dataset can be accessed via the phishtank.com website. Over 5000 random phishing URLs are collected from this dataset to train the ML models. The set of legitimate URLs was obtained from the University of New Brunswick's open datasets. This dataset can be found on the university's website. This dataset contains URLs that are benign, spam, phishing, malware, or defacement. The benign URL dataset is being considered for this project out of all of these types. Over 5000 random legitimate URLs are collected from this dataset to train the ML models.

B. Pre-processing

Following data collection, the first and most important step is data pre-processing. The raw dataset for phishing detection was prepared by removing redundant and irregular data and then encoded into a useful and efficient format suitable for the machine learning model using the One-Hot Encoding technique.

C. Exploratory Data Analysis

After a series of data cleaning steps, the dataset was subjected to exploratory data analysis (EDA). The data visualisation method was used to analyse, explore, and summarise the dataset. These visualisations include heat maps, histograms, box plots, scatter plots, and pair plots to uncover patterns and insights within data.

D. Feature Extraction

The goal of feature extraction is to reduce the number of features in a dataset by generating new ones from existing ones. Thus, website content-based features such as the Address bar-based feature, which consists of 9 features, the Domain-based feature, which consists of 4 features, and the HTML & JavaScript-based feature, which consists of 4 features, were extracted from phishing and legitimate datasets. As a result, a total of 17 features were extracted for phishing detection.

E. Model Training

Model Training entails feeding data to Machine Learning algorithms to assist in identifying and learning good attributes of the dataset. This research problem is the result of supervised learning and belongs to the classification problem. The phishing detection algorithms include supervised machine learning models and a deep neural network that was used to train the dataset. Decision Tree, Random Forest, Support Vector Machines, XG Booster, Multilayer Perceptron, and Auto-encoder Neural Network are among the algorithms used. The dataset was used to train all of these models. As a result, the dataset is divided into two parts: training and testing. The training model contains 80% of the dataset, allowing machine learning models to learn more about the data and distinguish between phishing and spam.

F. Model Testing

Model testing is the process of evaluating the performance of a fully trained model on a testing set.

As a result, after 80% of the data has been trained, 20% of the dataset is used to evaluate the trained dataset to see how well the models perform.

G. Model Assessment

Model evaluation entails estimating model generalisation accuracy and deciding whether or not the model performs better.

Thus, the Scikit-learn (sklearn matrices) module was used to implement several score and utility functions to measure classification performance in order to properly evaluate the phishing detection models.

VI. SYSTEM MODELLING

Figure 6.1 depicts the architecture of the proposed phishing detection system, in which a user enters a URL link, which is then passed through various trained machine learning and deep neural network models until the best model with the highest accuracy is chosen. As a result, the chosen model is deployed as an API (Application Programming Interface) and integrated into a web application. As a result, a user interacts with the web application, which is available on various display devices such as computers, tablets, and mobile devices.

VII. SYSTEM IMPLEMENTATION AND RESULTS

A. Data Collection

The dataset used for classification was obtained from a variety of sources, as detailed in the methodology.

The dataset used to classify the dataset into phishing and legitimate URLs was obtained from open source websites, with examples shown in figures 7.1 and 7.2 below.
(Source: The Dataset were obtained from the open datasets of the University of New Brunswick, The dataset consists of collections of benign, spam, phishing, malware & defacement URLs. Out of all these types, the benign URL dataset is considered for this project. This dataset consists of 5,000 random legitimate URLs which are collected to train the ML models.)

Feature extraction on the datasets

The features extraction used on the dataset are categorized into,

a. Address bar based features

b. Domain-based features

c. Html & java-script based features

2. Data Analysis & Visualization

Figure 8.1 depicts a distribution plot of how legitimate and phishing datasets are distributed based on the features chosen and how they are related to one another.

Figure 8.2 depicts a correlation heat-map of the dataset. The plot depicts the relationship between various variables in the dataset.

Figures 8.3 depict the feature importance in the model for the Decision tree classifier and the Random forest classifier, respectively.

3. Phishing Detection Model

According to the methodology, the proposed system makes use of machine learning models and deep neural networks. Decision Tree, Support Vector Machine, XG Booster, Multilayer Perceptions, Auto Encoder Neural Network, and Random Forest are among the models used. The models determine whether a website URL is legitimate or phishing. The models assist in providing a two-class prediction (legitimate (0) and phishing (1)). Over six (6) machine learning models and deep neural network algorithms were used in the model development process to detect phishing URLs using Jupyter notebook IDE with packages such as pandas, Beautiful Soup, who-is, urllib, and others. The models are shown in figure 8.4, and their accuracy was tested using sklearn matrices with an accuracy score. The XG Booster model achieved the highest performance score of 86.6%, followed by the Multilayer Perceptions model at 86.5%, the Decision Tree model at 81.4%, the Random Forest model at 81.8%, the Support Vector Machine model at 80.4%, and the Auto Encoder Neural Network model at 16.1%.

4. General Working of the System

"Phish-BusterV2," a one-page phishing detection web application, has been developed to run on any browser. HTML, CSS, PHP, and JavaScript were used in the development of the application.

The following pages are available in the phishing detection web application:

a) The Home Page

The home page includes a session in which a user can enter a URL and predict whether it is phishing or legitimate.

Figure 8.5 shows how it predicts the state of the URL based on the feature selection. The goal of this page is to assist users in validating a URL link as well as to provide various resources on phishing attacks. A Google phishing test can also be taken to help the user understand how to detect phishing messages and URLs. Users can also download a book containing information and other resources on phishing.

Conclusion

A. Summary Phishing attacks are a rapidly growing cyber threat that costs internet users billions of dollars each year. It entails employing a variety of social engineering techniques to obtain sensitive information from users. As a result, Phishing techniques can be detected through a variety of modes of communication, such as email, instant messaging, pop-up messages, and web pages. This project was able to categorise and recognise how phishers carry out phishing attacks, as well as the various ways in which researchers have assisted in the detection of phishing. As a result, the proposed system for this project used various feature selection, machine learning, and deep neural networks to identify patterns, including Decision Tree, Support Vector Machine, XG Booster, Multilayer Perceptions, Auto Encoder Neural Network, and Random Forest. The Model with the highest accuracy based on the feature extraction algorithm used to distinguish phishing URL links from legitimate URL links was integrated into a web application where users could enter website URL links to determine whether they were legitimate or phishing. B. Conclusion Using machine learning models and deep neural network algorithms, the system developed determines whether a URL link is phishing or legitimate. The feature extraction and models used on the dataset aided in the unique identification of phishing URLs, as well as the performance accuracy of the models used. It is also surprisingly accurate at determining the legitimacy of a URL link. C. Recommendation Through this project, one can know a lot about phishing attacks and how to prevent them. This project can be taken further by creating a browser extension that can be installed on any web browser to detect phishing URL Links.

References

[1] Abdelhamid, N., Thabtah F., & Abdel-Jaber, H. Phishing detection: A recent intelligent machine learning comparison based on models’ content and features,\" 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), Beijing, 2017, pp. 72-77, DOI: 10.1109/ISI.2017.8004877. [2] Anjum N. S., Antesar M. S., & Hossain M.A. (2016). A Literature Review on Phishing Crime, Prevention Review and Investigation of Gaps. Proceedings of the 10th International Conference on Software, Knowledge, Information Management & Applications (SKIMA), Chengdu, China, 2016, pp. 9-15, DOI: 10.1109/SKIMA.2016.7916190. [3] Almomani, A., Gupta, B. B., Atawneh, S., Meulenberg, A., & Almomani, E. (2013). A survey of phishing email filtering techniques, Proceedings of IEEE Communications Surveys and Tutorials, vol. 15, no. 4, pp. 2070–2090. [4] Ashritha, J. R., Chaithra, K., Mangala, K., & Deekshitha, S. (2019). A Review Paper on Detection of Phishing Websites using Machine Learning.Proceedings of International Journal of Engineering Research & Technology (IJERT), 7, 2. Retrieved from www.ijert.org. [5] Anti-Phishing Working Group (APWG) Phishing activity trends report the first quarter. (2014) Retrieved from http://docs.apwg.org/reports/apwg trends report q1 2014.pdf. [6] APWG report. (2014). Retrieved from http://apwg.org/download/document/245/APWG Global Phishing Report 2H 2014.pdf. [7] Ayush, P. (2019). Workflow of a Machine Learning project. Retrieved from https://towardsdatascience.com/workflow-of-a-machine-learning-projectec1dba419b9. [8] Camp W. (2001). Formulating and evaluating theoretical frameworks for career and technical education research. Journal of Vocational Education Research, 26(1), 4- 25. [9] DeepAI (n.d.). About clinical psychology. Retrieved from https://deepai.org/machinelearning-glossary-and-terms/feature-extractio [10] Engine K., & Christopher K. (2005). Protecting Users Against Phishing Attacks. Proceedings of the Oxford University Press on behalf of The British Computer Society, Oxford University, 0, 2005, Retrieved from: https://sites.cs.ucsb.edu/~chris/research/doc/cj06_phish.pdf [11] Gandhi, V. (2017). A Theoretical Study on Different ways to identify the Phishing URL and Its Prevention Approaches: presented at International Conference on Cyber Criminology, Digital Forensics and Information Security at DRBCCC Hindu College, Chennai. Retrieved from https://www.researchgate.net/publication/319006943_A_Theoretical_Study_on_Different_ways_to_Identify_the_Phishing_URL_and_Its_Prevention_Approaches [12] Gupta, B. B., Tewari, A., Jain, A. K., & Agrawal, D. P. (2016). Fighting against phishing attacks: state of the art and future challenges, Neural Computing and Applications. Internet world stats usage and population statistics. (2014). Retrieved from http://www.internetworldstats.com/stats.htmL. [13] Imperva. (2021). Phishing attacks. Retrieved from https://www.imperva.com/learn/application-security/phishing-attack-scam/ [14] Kiruthiga, R., Akila, D. (2019, September). Phishing Websites Detection Using Machine Learning. Retrieved from https://www.researchgate.net/publication/337049054 Phishing Websites Detection Using_Machine_Learning. [15] KnowBe4 (2021). Phishing Techniques. Retrieved from https://www.phishing.org/phishing-techniques [16] Kondeti, P. S., Konka, R. C., & Kavishree, S. (2021). Phishing Websites Detection using Machine Learning Techniques. International Research Journal of Engineering and Technology, 08(4), Page 1471-1473. Retrieved from https://www.irjet.net/archives/V8/i4/IRJET-V8I4274.pdf [17] Noel, B. (2016). Support Vector Machines: A Simple Explanation. Retrieved from https://www.kdnuggets.com/2016/07/support-vector-machines-simpleexplanation.html [18] Osanloo, A., & Grant, C. (2016). Understanding, selecting, and integrating a theoretical framework in dissertation research: creating the blueprint for your “house”. Administrative issues journal: connecting education, practice and research 4(2), 7 [19] Peng, T., Harris, I., & Sawa, I. (2018). Detecting Phishing Attacks Using Natural Language Processing and Machine Learning. Proc. - 12th IEEE Int. Conf. Semant. Comput. ICSC 2018, vol. 2018–Janua, pp. 300–301. [20] Pamela (2021). Phishing attacks. Retrieved from https://www.khanacademy.org/computing/computers-and internet/xcae6f4a7ff015e7d:online-data-security/xcae6f4a7ff015e7d:cyberattacks/a/phishing-attack. [21] Rami, M. M., Fadi, T., & Lee, M. (2015). Phishing Websites Features. Retrieved from https://eprints.hud.ac.uk/id/eprint/24330/6/MohammadPhishing14July2015.pdf [22] Rishikesh, M., & Irfan, S. (2018a). Phishing Website Detection using Machine Learning Algorithms. International Journal of Computer Applications, 23, 45. doi:10.5120/ijca2018918026 [23] Rishikesh, M., & Irfan, S. (2018b). Phishing Website Detection using Machine Learning Algorithms. International Journal of Computer Applications, 23, 45-46. doi:10.5120/ijca2018918026. [24] Rahul, S. (2017). How the decision tree algorithm works. Retrieved from https://dataaspirant.com/how-decision-tree-algorithm-works/

Copyright

Copyright © 2023 Prasanth Baskaran, Ronald Issac B J, Manju R, Chandru. G. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET51083

Publish Date : 2023-04-26

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here